Part 1: Econometrics
Maximum possible points: 14
The libraries you will need for the project are the following:
You will use each of the four econometric approaches seen in class for estimating causal effects to measure the effect of HISP on household health expenditures. Don’t worry about conducting in-depth baseline checks and robustness checks. The purpose is not to have an in-depth understanding of the variables of the database, but to be able to use several methods and to explain the results. Also remember that, when reporting the solutions, a numeric answer is not enough! We highly recommend to provide an intuitive argument to your answers.
OLS Regression
Preliminary step A: Create a dataset based on HISP that only includes observations from after the intervention (round == "After"). Note that you will use this database for all the OLS and IV exercises.
HISP_after <- read.csv("HISP.csv") %>% filter(round == "After")
HISP_after
1. Build a regression model that estimates the effect of HISP enrollment on health expenditures. You’ll need to use the enrolled_rp variable instead of enrolled, since we’re measuring enrollment after the promotion campaign. Report the regression results in a table. What does this model tell us about the effect of enrolling in HISP?
model1 <- lm(health_expenditures ~ enrolled_rp, data = HISP_after) #linear model m1
stargazer(model1, type = "text") #to check what my model1 gives
##
## ===============================================
## Dependent variable:
## ---------------------------
## health_expenditures
## -----------------------------------------------
## enrolled_rp -12.708***
## (0.229)
##
## Constant 20.587***
## (0.124)
##
## -----------------------------------------------
## Observations 9,914
## R2 0.237
## Adjusted R2 0.237
## Residual Std. Error 10.388 (df = 9912)
## F Statistic 3,075.623*** (df = 1; 9912)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
There is a significant negative effect of enrollment following random promotion on health expenditures (p<0.01). It means that when households are in HISP following random promotion, they spend less money on health expenditures. Indeed, households enrolled in the HISP program after random promotion spend on average $12,71 less on out of the pocket health expenditures.
2. Build the same regression model but now include the following control variables: age_hh, age_sp, educ_hh, educ_sp, female_hh, indigenous, hhsize,dirtfloor, bathroom, land, hospital_distance, park, and sports. Report the results in a table.
model2 <- lm(health_expenditures ~ enrolled_rp + age_hh + age_sp + educ_hh + educ_sp + female_hh + indigenous + hhsize + dirtfloor + bathroom + land + hospital_distance + park + sports, data = HISP_after)
stargazer(model2, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## health_expenditures
## -----------------------------------------------
## enrolled_rp -9.815***
## (0.213)
##
## age_hh 0.074***
## (0.011)
##
## age_sp -0.013
## (0.013)
##
## educ_hh 0.040
## (0.042)
##
## educ_sp -0.046
## (0.045)
##
## female_hh 1.031***
## (0.336)
##
## indigenous -2.339***
## (0.213)
##
## hhsize -2.038***
## (0.044)
##
## dirtfloor -2.084***
## (0.203)
##
## bathroom 0.687***
## (0.195)
##
## land 0.095***
## (0.030)
##
## hospital_distance -0.004*
## (0.002)
##
## park 0.042
## (0.064)
##
## sports -0.006
## (0.032)
##
## Constant 29.112***
## (0.665)
##
## -----------------------------------------------
## Observations 9,914
## R2 0.406
## Adjusted R2 0.405
## Residual Std. Error 9.173 (df = 9899)
## F Statistic 482.659*** (df = 14; 9899)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Compared to the model in question 1, is the coefficient of enrollment underestimated or overestimated in Q1?
When controlling for other variables, enrolled_rp is lower in the second model (-9.815 vs -12.708, a 22.77% difference) so it is overestimated in Q1.
Interpret the coefficient of age_hh
It means that if the “head of the household” is one year older, the out of pocket health expenditures per person per year increases by 0.074 dollars (p<0.01).
What is the estimated effect in health expenditures for a household with a private bathroom, holding all other variables constant?
Holding all over variables constant, the household’ health_expenditures increases by 0.687 dollars.
Instrumental Variables Regression
Remark: For the IV exercises, we will use the database created in Preliminary Step A, that only includes observations from after the intervention (round == "After").
Consider the following model:
\(health\_expenditures\) \(=\beta_{0}+\beta_{1}\) \(enrolled\_rp\) \(+\beta_{2}\) \(age\_hh\) \(+\beta_{3}\) \(age\_sp\) \(+\beta_{4}\) \(educ\_hh\) \(+\beta_{5}\) \(educ\_sp\) \(+\beta_{6}\) \(female\_hh\) \(+\) \(\beta_{7}\) \(indigenous\) \(+\beta_{8}\) \(hhsize\) \(+\beta_{9}\) \(dirtfloor\) \(+\beta_{10}\) \(bathroom\) \(+\beta_{11}\) \(land\) \(+\beta_{12}\) \(hospital\_distance\) \(+\beta_{13}\) \(park\) \(+\beta_{14}\) \(sports\)
3. Is there a possible endogeneity in one of the variables of the model? If so, why?
Yes, the variable sport could be endogeneous with health_expenditures. Depending on
4. If there is endogeneity in one of the variables, find a possible instrument in the data and discuss if it is suitable to correct it.
5. Run a 2SLS regression model using the previous instrument. Report the regression results in a table. After removing the endogeneity issue, what is the causal effect of enrollment in the HISP?
We can estimate the causal effect using a difference-in-difference approach. We have data indicating if households were enrolled in the program (enrolled) and data indicating if they were surveyed before or after the intervention (round), which means we can find the differences between enrolled/not enrolled before and after the program.
Remark Since we do not have enough data before the program started, we assume that these two groups share parallel trends before the treatment. Hence, it is fine to perform a Diff-in-Diff estimation.
Preliminary step B: Make a new dataset based on HISP that only includes observations from the localities that were randomly chosen for treatment (treatment_locality == "Treatment"). Remark: Use this dataset for the Diff-in-Diff and RDD exercises.
HISP_treatment <- read.csv("HISP.csv") %>% filter(treatment_locality == "Treatment")
HISP_treatment
#distinct(HISP_treatment,treatment_locality) #just to check
#distinct(HISP_treatment,round)
6. Obtain the average of health_expenditures, age_hh, age_sp, educ_hh, educ_sp, female_hh, indigenous, hhsize,dirtfloor, bathroom, land,hospital_distance, park,sports for every time period in both the treatment and control groups (i.e. enrolled and not enrolled), and report them in a table. Analyze the differences in the variables for both treatment and control groups.
HISP_treatment <- HISP_treatment %>%
mutate(enrolled_dummy = ifelse(enrolled == "Enrolled", 1, 0), round_dummy = ifelse(round == "After", 1, 0))
#if enrolled="Enrolled", enrolled_dummy = 1; if round=after, round_dummy=1
HISP_treatment
HISP_treatment %>%
group_by(round,enrolled) %>%
summarise(mean(health_expenditures), mean(age_hh), mean(age_sp), mean(educ_hh), mean(educ_sp), mean(female_hh), mean(indigenous), mean(hhsize), mean(dirtfloor), mean(bathroom), mean(land), mean(land), mean(hospital_distance), mean(park), mean(sports))
## `summarise()` has grouped output by 'round'. You can override using the `.groups` argument.
There appears to be little changes in the variables except for the health_expenditures variable and age. Concerning age, people not enrolled in the HISP program are older, which could be explained by the fact that the older households are, the more money they usually have, until retirement (thus not eligible to enrollment in the HISP program). (https://www.federalreserve.gov/publications/files/scf20.pdf, page 7). However, we can notice that being enrolled in the HISP program has a significant impact on health expenditures, and even more after intervention. Indeed, people not enrolled in the HISP program after intervention have about three times the health expenditures of not enrolled households ($22.304911 vs $7.840179)
7. Run a regression model that estimates the difference-in-difference effect of being enrolled in the HISP program. Report the regression results in a table. What is the causal effect of HISP on health expenditures?
modeldif <- lm(health_expenditures ~ round_dummy + enrolled_dummy + round_dummy*enrolled_dummy, data = HISP_treatment)
stargazer(modeldif, type = "text")
##
## ======================================================
## Dependent variable:
## ---------------------------
## health_expenditures
## ------------------------------------------------------
## round_dummy 1.513***
## (0.251)
##
## enrolled_dummy -6.302***
## (0.229)
##
## round_dummy:enrolled_dummy -8.163***
## (0.324)
##
## Constant 20.791***
## (0.177)
##
## ------------------------------------------------------
## Observations 9,919
## R2 0.344
## Adjusted R2 0.343
## Residual Std. Error 7.913 (df = 9915)
## F Statistic 1,730.145*** (df = 3; 9915)
## ======================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The difference-in-difference estimate is -8.1629 (enrolled_dummy:round_dummy’ row). It means that the enrolling in the HISP has a negative impact on health expenditure: the health are now about $8,163
#just to have a clearer point of view with graphs
HISP_treatment$enrolled <- as.factor(HISP_treatment$enrolled)
HISP_treatment$round <- as.factor(HISP_treatment$round)
ggplot(HISP_treatment, mapping = aes(x = round, y = health_expenditures, color = enrolled)) +
geom_point(size = 2, alpha = 0.4)
cdplot(enrolled ~ health_expenditures , data=HISP_treatment)
cdplot(round ~ health_expenditures , data=HISP_treatment)
8. Run a second model that estimates the difference-in-difference effect, but control for the following variables:
age_hh, age_sp, educ_hh, educ_sp, female_hh, indigenous, hhsize,dirtfloor, bathroom, land,hospital_distance, park, sports. Report the regression results in a table. How does the causal effect change?
modeldif2 <- lm(health_expenditures ~ enrolled_dummy + round_dummy + enrolled_dummy * round_dummy + age_hh + age_sp + educ_hh + educ_sp + female_hh + indigenous + hhsize + dirtfloor + bathroom + land + hospital_distance + park + sports, data = HISP_treatment)
summary(modeldif2)
##
## Call:
## lm(formula = health_expenditures ~ enrolled_dummy + round_dummy +
## enrolled_dummy * round_dummy + age_hh + age_sp + educ_hh +
## educ_sp + female_hh + indigenous + hhsize + dirtfloor + bathroom +
## land + hospital_distance + park + sports, data = HISP_treatment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.887 -1.851 -0.265 0.914 80.581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.379043 0.496421 55.153 < 2e-16 ***
## enrolled_dummy -1.512755 0.208989 -7.238 4.88e-13 ***
## round_dummy 1.450705 0.207301 6.998 2.76e-12 ***
## age_hh 0.080491 0.008191 9.827 < 2e-16 ***
## age_sp -0.019726 0.009279 -2.126 0.033533 *
## educ_hh 0.060032 0.029811 2.014 0.044063 *
## educ_sp -0.076499 0.032416 -2.360 0.018299 *
## female_hh 1.103694 0.241104 4.578 4.76e-06 ***
## indigenous -2.312066 0.147543 -15.670 < 2e-16 ***
## hhsize -1.994700 0.033031 -60.388 < 2e-16 ***
## dirtfloor -2.299754 0.145472 -15.809 < 2e-16 ***
## bathroom 0.499839 0.138913 3.598 0.000322 ***
## land 0.090874 0.021580 4.211 2.56e-05 ***
## hospital_distance -0.003190 0.001675 -1.905 0.056858 .
## park 0.002208 0.045377 0.049 0.961198
## sports 0.001727 0.022963 0.075 0.940039
## enrolled_dummy:round_dummy -8.161552 0.268012 -30.452 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.544 on 9902 degrees of freedom
## Multiple R-squared: 0.5516, Adjusted R-squared: 0.5509
## F-statistic: 761.3 on 16 and 9902 DF, p-value: < 2.2e-16
By controlling for other variables, we see no notable change in health_expenditures - it is still about -$8.16x - nor in the other variables.
Summary
9. Summarize the results from the three methods (OLS, IV and Diff-in-Diff) in the following table, and discuss the advantages and disadvantages of one method of your choice.
Method = c('OLS', 'OLS', 'IV', 'Diff-in-Diff', 'Diff-in-Diff')
Including_control_variables = c("No","Yes","Yes","No","Yes")
Estimate = c(model1$coefficients[2],model2$coefficients[2],"0",modeldif$coefficients[4],modeldif2$coefficients[17])
cbind.data.frame(Method,Including_control_variables,Estimate)
The diff-in-diff method is useful because it enables to estimate the impact of the enrollment in the HISP program as if it was planned as an experiment (control and treatment groups), doing ex-post analysis on available data.
Regression Discontinuity Design
Eligibility for the HISP is determined by income. Households that have an income of less than 58 on a standardized 1-100 scale (poverty_index) qualify for the program and are automatically enrolled. Because we have an arbitrary cutoff in a running variable, we can use regression discontinuity to measure the effect of the program on health expenditures.
10. Why choosing the bandwidth is important in regression discontinuity? What are the pros and cons of choosing a small/large bandwidth
RD is about comparing two groups that are very similar except for the treatment because the treatment depends discontinuously on the cutoff (poverty_index here) (Green et al., 2009). Therefore the bandwidth must have some proximity to the cutoff (local strategy), so that the differences observed at each group are attributable to the treatment and not to a difference in the characteristics of the treated groups. For instance, we can assume there is little to no difference of characteristics between households who have a poverty_index of 58 and households who have a poverty_index of 57 even if the latter are not eligible for the HISP program. The health_expenditures comparison on either side of 57 has high internal validity. A smaller bandwidth facilitate local linear regression but it may generate estimates that are too uncertain to be useful. There is a tradeoff to find between choosing a smaller bandwidth which allows to reduce bias (less data) but eventually omits valuable cases, and a larger bandwidth which allows the increase precision (there is more data) but will eventually take outliers: the results could be explained not only by the RD but by the characteristics of the households.
Literature gives insights about the optimal bandwidth choice (Imbens et al., 2009)
11. Suppose that you are including all households with poverty_index in between 53 and 63 (bandwidth=5). Before running an equation, what would you check to make sure that the regression discontinuity method is valid in this case, and why?
We could start by choosing a wider bandwidth=10 for instance, decreasing the bandwidth by iteration, and check for the distribution of predetermined characteristics of households by plotting them on a graph: they should be identical on either side of the cutoff, the smaller the bandwidth we chose. We should have a visual evidence of a “jump” at the cuttoff score (57) and test its statistical significance. We can also try to place placebo cutoffs and see if there is still a jump. The plotting can be done by using bins. We should check if the bandwidth=5 is the bandwidth that minimizes the mean square error between actual and estimated treatment effects because the bias-variance tradeoff is captured in it the MSE of the estimator. (see https://www.mattblackwell.org/files/teaching/s11-rdd-handout.pdf, slide 52)